feat(trainer): Support namespaced TrainingRuntime in the SDK #130

shaikmoeed · 2025-10-29T07:35:52Z

What this PR does / why we need it:
Add support to list/get namespaced TrainingRuntime.

Which issue(s) this PR fixes:

Fixes #88

Checklist:

Docs included if any changes are user facing

google-oss-prow · 2025-10-29T07:35:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kramaranya · 2025-10-29T07:51:23Z

/ok-to-test

Signed-off-by: Moeed Shaik <[email protected]>

abhijeet-dhumal · 2025-11-03T19:21:11Z

Thank you @shaikmoeed for this!
Left some nit-picks..

abhijeet-dhumal · 2025-11-03T19:21:56Z

kubeflow/trainer/backends/kubernetes/backend.py


    def get_runtime(self, name: str) -> types.Runtime:
-        """Get the the Runtime object"""
+        """Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""


Suggested change

"""Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""

"""Get the Runtime object prefer namespaced, fall-back to cluster-scoped"""

Same change goes for each occurence here

abhijeet-dhumal · 2025-11-03T19:29:29Z

kubeflow/trainer/backends/kubernetes/backend.py

+            )
+
+            cluster_runtime_list = models.TrainerV1alpha1ClusterTrainingRuntimeList.from_dict(
+                cluster_thread.get(constants.DEFAULT_TIMEOUT)


Suggested change

cluster_thread.get(constants.DEFAULT_TIMEOUT)

cluster_thread.get(common_constants.DEFAULT_TIMEOUT)

Done! Reason for test case failures!

szaher · 2025-11-04T11:31:36Z

kubeflow/trainer/backends/kubernetes/backend_test.py

+def create_training_runtime(
+    name: str,
+    namespace: str = "default",
+) -> models.TrainerV1alpha1TrainingRuntime:
+    """Create a mock namespaced TrainingRuntime object (not cluster-scoped)."""
+    return models.TrainerV1alpha1TrainingRuntime(
+        apiVersion=constants.API_VERSION,
+        kind="TrainingRuntime",
+        metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
+            name=name,
+            namespace=namespace,
+            labels={constants.RUNTIME_FRAMEWORK_LABEL: name},
+        ),
+        spec=models.TrainerV1alpha1TrainingRuntimeSpec(
+            mlPolicy=models.TrainerV1alpha1MLPolicy(
+                torch=models.TrainerV1alpha1TorchMLPolicySource(
+                    numProcPerNode=models.IoK8sApimachineryPkgUtilIntstrIntOrString(2)
+                ),
+                numNodes=2,
+            ),
+            template=models.TrainerV1alpha1JobSetTemplateSpec(
+                metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(
+                    name=name,
+                    namespace=namespace,
+                ),
+                spec=models.JobsetV1alpha2JobSetSpec(replicatedJobs=[get_replicated_job()]),
+            ),
+        ),
+    )
+
+


did you mean to create this in kubernetes/backend_test.py?
this is not a test function and I believe it should be added to the TrainerClient and propagated to the different backends.

Similar to create_train_job, I thought to use create_training_runtime to pass here instead of empty list. Please correct me, if this was not intended?

Signed-off-by: Moeed Shaik <[email protected]>

Signed-off-by: Moeed <[email protected]>

coveralls · 2025-11-19T18:00:14Z

Pull Request Test Coverage Report for Build 19512318478

Details

22 of 25 (88.0%) changed or added relevant lines in 3 files are covered.
5 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.04%) to 66.631%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
kubeflow/trainer/backends/kubernetes/backend.py	15	18	83.33%

Files with Coverage Reduction	New Missed Lines	%
kubeflow/trainer/backends/kubernetes/backend.py	5	79.05%

Totals
Change from base Build 19462635145:	0.04%
Covered Lines:	2518
Relevant Lines:	3779

💛 - Coveralls

Signed-off-by: Moeed Shaik <[email protected]>

shaikmoeed · 2025-11-19T18:38:59Z

@abhijeet-dhumal, can you review it again during your free time?

kramaranya

Thank you @shaikmoeed!
I'd like all of us to spend more time on designing this properly, since there are a lot of items to consider
/assign @kubeflow/kubeflow-sdk-team

kramaranya · 2025-11-21T13:22:48Z

kubeflow/trainer/backends/kubernetes/backend.py

        except multiprocessing.TimeoutError as e:
-            raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e
+            raise TimeoutError(
+                "Timeout to list "
+                f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
+                f"in namespace: {self.namespace}"
+            ) from e
        except Exception as e:
-            raise RuntimeError(f"Failed to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e
+            raise RuntimeError(
+                "Failed to list "
+                f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
+                f"in namespace: {self.namespace}"
+            ) from e


What If only cluster runtimes exist (no TrainingRuntime CRD)? The entire list_runtimes will fail with RuntimeError, which we don't want, right?

kramaranya · 2025-11-21T13:24:39Z

kubeflow/trainer/backends/kubernetes/backend.py

        except multiprocessing.TimeoutError as e:
-            raise TimeoutError(f"Timeout to list {constants.CLUSTER_TRAINING_RUNTIME_KIND}s") from e
+            raise TimeoutError(
+                "Timeout to list "
+                f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
+                f"in namespace: {self.namespace}"
+            ) from e


What if one type times out and the other succeeds? I think we still should return partial results

kramaranya · 2025-11-21T13:27:02Z

kubeflow/trainer/backends/kubernetes/backend.py

+                "Timeout to list "
+                f"{constants.CLUSTER_TRAINING_RUNTIME_KIND}s/{constants.TRAINING_RUNTIME_KIND}s "
+                f"in namespace: {self.namespace}"


nit: cluster scoped runtimes are not really in a namespace

kramaranya · 2025-11-21T13:37:36Z

kubeflow/trainer/backends/kubernetes/backend.py

+            except Exception as e:
+                logger.warning(
+                    f"Namespaced TrainingRuntime '{self.namespace}/{name}' not found "
+                    f"({type(e).__name__}: {e}); falling back to cluster-scoped runtime."
+                )


What if it was for example TimeoutError? We will still silently fallback to cluster scoped runtimes, right? I would suggest only treating not found / missing CRD as fall back.

kramaranya · 2025-11-21T13:38:44Z

kubeflow/trainer/backends/kubernetes/backend.py


        return result

    def get_runtime(self, name: str) -> types.Runtime:


Similar to list_runtimes(), either case will cause a full failure

kramaranya · 2025-11-21T13:58:06Z

kubeflow/trainer/constants/constants.py

+# The Kind name for the TrainingRuntime.
+TRAINING_RUNTIME_KIND = "TrainingRuntime"
+
+# The plural for the ClusterTrainingRuntime.


Suggested change

# The plural for the ClusterTrainingRuntime.

# The plural for the TrainingRuntime.

abhijeet-dhumal · 2025-11-24T06:49:12Z

kubeflow/trainer/backends/kubernetes/backend.py

                return result

-            for runtime in runtime_list.items:
+            for runtime in namespace_runtime_list.items + cluster_runtime_list.items:


@shaikmoeed
Quick question : What if runtimes with the same name exists in both cluster and namespace scoped ?
IIUC you have implemented namespace scoped priority in get_runtime() where trainingRuntimes get's first priority..

thinking should it be same case for list_runtimes method too ?
And one more thing, In case of list_runtimes it's just appending both even if duplicates comes in..
So for end user how user will be able to know the kind of runtime via list_runtimes's list items ?

What if we introduce kind and namespace(optional) params in Runtime dataclass here :

sdk/kubeflow/trainer/types/types.py

Line 251 in 49a5087

class Runtime:

So that

for runtime in runtimes: print(f"{runtime.name} ({runtime.kind}, ns={runtime.namespace})") # Output: # torch-runtime (TrainingRuntime, ns=team-a) # torch-runtime (ClusterTrainingRuntime, ns=None) # custom-runtime (TrainingRuntime, ns=team-a)

WDYT? @kramaranya @szaher ?

google-oss-prow bot requested review from kramaranya and szaher October 29, 2025 07:35

google-oss-prow bot added the size/L label Oct 29, 2025

shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from 24f00a7 to 1740535 Compare October 29, 2025 07:36

shaikmoeed changed the title ~~feat(backend): Support namespaced TrainingRuntime in the SDK~~ feat(trainer): Support namespaced TrainingRuntime in the SDK Oct 29, 2025

google-oss-prow bot added the ok-to-test label Oct 29, 2025

shaikmoeed mentioned this pull request Oct 30, 2025

Support namespaced TrainingRuntime in the SDK #88

Open

andreyvelich mentioned this pull request Nov 3, 2025

feat: Implement Training Options pattern for flexible TrainJob customization #91

Merged

1 task

shaikmoeed added 2 commits November 3, 2025 13:50

feat(backend): Support namespaced TrainingRuntime in the SDK

c91aef7

Signed-off-by: Moeed Shaik <[email protected]>

Fixed bugs and validated current test cases

de2ad1b

Signed-off-by: Moeed Shaik <[email protected]>

shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from 8f0b6d5 to de2ad1b Compare November 3, 2025 13:51

Fixed pre-commit test failure

b569164

Signed-off-by: Moeed Shaik <[email protected]>

abhijeet-dhumal reviewed Nov 3, 2025

View reviewed changes

szaher reviewed Nov 4, 2025

View reviewed changes

shaikmoeed added 2 commits November 19, 2025 17:29

Addressed comments

020e475

Signed-off-by: Moeed Shaik <[email protected]>

Fixed no attribute 'DEFAULT_TIMEOUT' error

32b18fd

Signed-off-by: Moeed Shaik <[email protected]>

shaikmoeed force-pushed the fix/namespace-trainingruntime-list branch from d1ca707 to 32b18fd Compare November 19, 2025 17:56

Merge branch 'main' into fix/namespace-trainingruntime-list

90382fc

Signed-off-by: Moeed <[email protected]>

Added namespace-scoped runtime to test cases

35206fe

Signed-off-by: Moeed Shaik <[email protected]>

kramaranya reviewed Nov 22, 2025

View reviewed changes

abhijeet-dhumal reviewed Nov 24, 2025

View reviewed changes

	"""Get the the Runtime object prefer namespaced, fall-back to cluster-scoped"""
	"""Get the Runtime object prefer namespaced, fall-back to cluster-scoped"""

	cluster_thread.get(constants.DEFAULT_TIMEOUT)
	cluster_thread.get(common_constants.DEFAULT_TIMEOUT)


		return result

		def get_runtime(self, name: str) -> types.Runtime:

	# The plural for the ClusterTrainingRuntime.
	# The plural for the TrainingRuntime.

feat(trainer): Support namespaced TrainingRuntime in the SDK #130

Are you sure you want to change the base?

feat(trainer): Support namespaced TrainingRuntime in the SDK #130

Conversation

shaikmoeed commented Oct 29, 2025

Uh oh!

google-oss-prow bot commented Oct 29, 2025

Uh oh!

kramaranya commented Oct 29, 2025

Uh oh!

abhijeet-dhumal commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shaikmoeed Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19512318478

Details

💛 - Coveralls

Uh oh!

shaikmoeed commented Nov 19, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

abhijeet-dhumal commented Nov 3, 2025 •

edited

Loading

shaikmoeed Nov 19, 2025 •

edited

Loading

coveralls commented Nov 19, 2025 •

edited

Loading